Conversation
Introduce OnDiskGraphIndexCompactor and PQRetrainer for streaming N:1 merging of on-disk HNSW indexes without full in-memory materialization. Supports deletion filtering via live-node bitsets, custom ordinal mapping, and PQ codebook retraining.
Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions, ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.
Add JFR recording, system stats collection, JSONL logging, git info capture, thread allocation tracking, dataset partitioning, and cloud storage layout utilities used by CompactorBenchmark. Switch jvector-examples logging from logback to log4j2 for consistency with benchmarks-jmh and to avoid duplicate SLF4J bindings in the fat jar.
JMH-based benchmark with configurable workload modes (PARTITION_AND_COMPACT, PARTITION_ONLY, COMPACT_ONLY, BUILD_FROM_SCRATCH), recall measurement, JFR recording, and JSONL result logging. Includes BenchmarkParamCounter for progress tracking, EventLogAnalyzer for post-run analysis, GHA workflow, and exec-maven-plugin integration. Add forced vectorization provider property to VectorizationProvider for benchmark reproducibility.
Add result file patterns to .gitignore, update rat-excludes for the new compaction workflow and catalog cache files.
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has no Main-Class. Select the shaded JMH jar explicitly by excluding -javadoc and -sources jars.
Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main to avoid BenchmarkList discovery issues in CI's shaded jar.
| - Compaction: build source partitions with PQ; compact using FusedPQ with FP rescoring; search using FusedPQ with FP reranking. | ||
|
|
||
| | Dataset | Dim | Build from Scratch | Compaction | Delta | | ||
| |----------------------|-----:|-------------------:|-----------:|-------:| |
There was a problem hiding this comment.
Do we need to modify this section for brevity?
There was a problem hiding this comment.
I revised to the following:
Recall comparison (results averaged over three runs):
- Build from scratch: build one index over the full dataset with PQ scoring; search using FusedPQ with FP reranking.
- Compaction: partition the dataset into 4 source indexes (Fibonacci distribution), build each with PQ scoring, then compact into one index; search using FusedPQ with FP reranking.
| * Handles writing the compacted graph index to disk, managing header, node records, | ||
| * upper layers, and footer in the on-disk format. | ||
| */ | ||
| private static final class CompactWriter implements AutoCloseable { |
There was a problem hiding this comment.
Maybe break this out. The parent file is already very large.
There was a problem hiding this comment.
Extracted CompactWriter into its own top-level file.
| public final class SystemStatsCollector { | ||
| private static final Logger log = LoggerFactory.getLogger(SystemStatsCollector.class); | ||
|
|
||
| private static final String SCRIPT = String.join("\n", |
There was a problem hiding this comment.
Perhaps this logic shouldn't be broken out as it is. Instead of invoking a shell wrapper, it should probably be direct reads and pattern matching in Java.
There was a problem hiding this comment.
Thanks for the comment. The bash ProcessBuilder is replaced with a ScheduledExecutorService that reads /proc/cpuinfo, /proc/meminfo, /proc/loadavg, and /proc/diskstats directly via java.nio.file.Files. Same JSONL output format.
- Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size - Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash - Clarify recall section description in docs/compaction.md
This PR addresses #580 that adds
OnDiskGraphIndexCompactor, a streaming N:1 compaction algorithm for merging multiple on-disk HNSW graph indexes into a single compacted index.For a full description of the algorithm and benchmarking instructions, see docs/compaction.md and benchmarks-jmh/src/main/java/io/github/jbellis/jvector/bench/CompactorBenchmark.md.
Support
-Xmx5geven for 10M-vector, 2560-dim datasetsFixedBitSetper source excludes deleted nodes from outputOrdinalMappermaps each source's local ordinals to a contiguous global ordinal space; implementedOffsetMapperthat handles the common sequential caseUsage
Key Changes
OnDiskGraphIndexCompactor— core compaction algorithm with parallel ForkJoinPool execution and backpressure windowingPQRetrainer— balanced proportional sampling + sequential sorted reads for efficient codebook retrainingGraphSearcher,GraphIndexBuilder, andPQVectorsrequired by the compactorCompactorBenchmark— JMH benchmark withPARTITION_AND_COMPACT,PARTITION_ONLY,COMPACT_ONLY, andBUILD_FROM_SCRATCHmodesRecall
Comparison against build-from-scratch (results averaged over three runs).
Recall is generally comparable to build-from-scratch and often better, though some datasets show small drops. All datasets compact successfully under
-Xmx5g; compaction has also been validated on a 2560-dim 10M-vector dataset under the same constraint.